Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove remaining uses of FFI under -fpure-haskell #660

Merged
merged 4 commits into from
Feb 15, 2024

Conversation

clyring
Copy link
Member

@clyring clyring commented Feb 14, 2024

All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports.

The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte-at-a-time implementation!

Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection.

Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend.

(I noticed this situation while working on #569.)

(This is based on top of #659 to avoid pointless CPP.)

All of these were standard C functions that GHC's JS backend
actually somewhat supports; their shims can be found in the
compiler source at "rts/js/mem.js".  But it seems simpler to
just get rid of all FFI uses with -fpure-haskell rather than
try to keep track of which functions GHC supports.

The pure Haskell implementation of memcmp runs about 6-7x as fast
as the simple one-byte-at-a-time implementation for long equal
buffers, which makes it...  about the same speed as the
pre-existing shim, even though the latter is also a one-byte-
at-a-time implementation!

Apparently GHC's JS backend is not yet able to produce efficient
code for tight loops like these yet; the biggest problem is that
it does not perform any loopification so each iteration must go
through a generic-call indirection.

Unfortunately that means that this patch probably makes 'strlen'
and 'memchr' much slower with the JS backend.
@clyring
Copy link
Member Author

clyring commented Feb 14, 2024

cc @hsyl20

@hsyl20
Copy link
Contributor

hsyl20 commented Feb 15, 2024

@luite told me it would be a big performance hit for JS. Or we need to figure out how to optimize recursive functions like these before.

As an alternative we could perhaps define GHC primops for common libc operations? That would make bytestring pure Haskell too.

@luite
Copy link
Member

luite commented Feb 15, 2024

@luite told me it would be a big performance hit for JS. Or we need to figure out how to optimize recursive functions like these before.

As an alternative we could perhaps define GHC primops for common libc operations? That would make bytestring pure Haskell too.

Or perhaps something provided by base or ghc-internal

@doyougnu
Copy link

Alternatively, if GHC performed loopification via join points: https://gitlab.haskell.org/ghc/ghc/-/issues/14068 then we would get it for free.

@clyring
Copy link
Member Author

clyring commented Feb 15, 2024

We definitely should have a primop for memcmp; I dunno about strlen or memchr. But anyway the JS backend really should be able to optimize simple Haskell loops like these to something about as good as the shims currently in rts/js/mem.js.

There actually is a version of strlen in base, namely cstringLength#. It has the wrong type for use in I/O or on mutable buffers, but I guess we could use the same lazy runRW# hack as in deferForeignPtrAvailability to force it to be well-sequenced.

@clyring
Copy link
Member Author

clyring commented Feb 15, 2024

Alternatively, if GHC performed loopification via join points: https://gitlab.haskell.org/ghc/ghc/-/issues/14068 then we would get it for free.

It would be relatively trivial to write an StgToStg pass that writes loopification using join points. I can prepare a GHC patch in a few weeks if you like.

@clyring
Copy link
Member Author

clyring commented Feb 15, 2024

It looks like the tail calls of join points get turned into trampolines, which is better than the status quo for non-join-pointed tail calls but still not as good as we'd like for basic self-loops like these, which should be very efficiently implementable by wrapping them in while(true){ /* ... */ break; } and using continue to self-tail-call. Does that sound feasible? If so, let's make a feature ticket on the ghc tracker. (I know that the continuation-wrangling of a "real" function call might get in the way, but there are no such obstacles in this memcmp implementation.)

Also, we should be able to just trampoline for known exactly-saturated function calls in tail position instead of producing generic-call code, even if we are not calling a join point. Right?


In any case, my inclination is to just accept these performance regressions for now.

@doyougnu
Copy link

opened: https://gitlab.haskell.org/ghc/ghc/-/issues/24442

my inclination is to just accept these performance regressions for now.

sounds good to me.

@clyring clyring merged commit 305604c into haskell:master Feb 15, 2024
26 checks passed
clyring added a commit to clyring/bytestring that referenced this pull request Feb 15, 2024
clyring added a commit that referenced this pull request Feb 15, 2024
* WIP: Prepare changelog for 0.12.1.0

* fiddle with CI

* Revert "fiddle with CI"

This reverts commit 3e22005.

* More changelog updates

* Mention `pure-haskell` flag in Changelog.md

* Add hidden entry for #660
clyring added a commit that referenced this pull request Feb 15, 2024
All of these were standard C functions that GHC's JS backend
actually somewhat supports; their shims can be found in the
compiler source at "rts/js/mem.js".  But it seems simpler to
just get rid of all FFI uses with -fpure-haskell rather than
try to keep track of which functions GHC supports.

The pure Haskell implementation of memcmp runs about 6-7x as fast
as the simple one-byte-at-a-time implementation for long equal
buffers, which makes it...  about the same speed as the
pre-existing shim, even though the latter is also a one-byte-
at-a-time implementation!

Apparently GHC's JS backend is not yet able to produce efficient
code for tight loops like these yet; the biggest problem is that
it does not perform any loopification so each iteration must go
through a generic-call indirection.

Unfortunately that means that this patch probably makes 'strlen'
and 'memchr' much slower with the JS backend.

(cherry picked from commit 305604c)
clyring added a commit that referenced this pull request Feb 15, 2024
* WIP: Prepare changelog for 0.12.1.0

* fiddle with CI

* Revert "fiddle with CI"

This reverts commit 3e22005.

* More changelog updates

* Mention `pure-haskell` flag in Changelog.md

* Add hidden entry for #660

(cherry picked from commit 314e257)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants